datapath <- "/Users/eampo/Desktop/Stats_Final_Project/"
AssignmentData <- read.csv(file=paste(datapath,"regressionassignmentdata2014.csv",sep="/"), row.names=1,header=TRUE,sep=",")
head(AssignmentData)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR Output1
## 1/5/1981 13.52 13.09 12.289 12.28 12.294 12.152 11.672 18.01553
## 1/6/1981 13.58 13.16 12.429 12.31 12.214 12.112 11.672 18.09140
## 1/7/1981 14.50 13.90 12.929 12.78 12.614 12.382 11.892 19.44731
## 1/8/1981 14.76 14.00 13.099 12.95 12.684 12.352 11.912 19.74851
## 1/9/1981 15.20 14.30 13.539 13.28 12.884 12.572 12.132 20.57204
## 1/12/1981 15.22 14.23 13.179 12.94 12.714 12.452 12.082 20.14218
## Easing Tightening
## 1/5/1981 NA NA
## 1/6/1981 NA NA
## 1/7/1981 NA NA
## 1/8/1981 NA NA
## 1/9/1981 NA NA
## 1/12/1981 NA NA
tail(AssignmentData)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## 6/19/2014 0.0101 0.0456 0.4477 0.9255 1.6751 2.6206 3.4660
## 6/20/2014 0.0101 0.0355 0.4561 0.9256 1.6804 2.6052 3.4337
## 6/23/2014 0.0051 0.0355 0.4643 0.9470 1.7021 2.6261 3.4549
## 6/24/2014 0.0203 0.0507 0.4563 0.9311 1.6689 2.5781 3.3991
## 6/25/2014 0.0203 0.0456 0.4804 0.9124 1.6540 2.5592 3.3815
## 6/26/2014 0.0253 0.0507 0.4646 0.8803 1.6462 2.5214 3.3458
## Output1 Easing Tightening
## 6/19/2014 -11.70638 NA NA
## 6/20/2014 -11.71994 NA NA
## 6/23/2014 -11.68766 NA NA
## 6/24/2014 -11.73219 NA NA
## 6/25/2014 -11.74946 NA NA
## 6/26/2014 -11.79219 NA NA
After reading in the data, it looks like there are 7 different input variables. These input variables are daily records of the U.S. Treasury yields to maturity from 1/5/1981 to 6/26/2014. It also includes an unknown output variable and two other variables ‘Easing’ and ‘Tightening’.
matplot(AssignmentData[,-c(8,9,10)],type='l')
After plotting the 7 input variables, it looks like the overall trend is that U.S. treasury yields to maturity is decreasing. ‘Yield to Maturity’ is the internal rate of return (IRR) earned by an investor who buys the bond today at market price. This is assuming that the bond is held until maturity. Treasury yields have traditionally been thought of as a safe investment because it is backed by the U.S. government. The return on these bonds can function as economic indicators. If the yield is high, it can tell us that investors are confident that they can find other investments with better returns. In contrast, if yields are low, it can tell us that investors are being forced to settle for these safe investments. Treasury yields has an impact on in interest rates, particularly in real estate.
matplot(AssignmentData[,-c(9,10)],type='l')
It looks like the output variable is higher than the input variables in early 1980s and has kept declining ever since. It starts to even out with the input variables later on in the 1980s before completely going below the input variables later on in the graph. It has stayed below the input variables ever since.
boxplot(AssignmentData$USGG3M, main="USGG3M")
boxplot(AssignmentData$USGG6M, main="USGG6M")
boxplot(AssignmentData$USGG2YR, main="USGG2YR")
boxplot(AssignmentData$USGG3YR, main="USGG3YR")
boxplot(AssignmentData$USGG5YR, main="USGG5YR")
boxplot(AssignmentData$USGG10YR, main="USGG10YR")
boxplot(AssignmentData$USGG30YR, main="USGG30YR")
cor1 <- cor(AssignmentData$USGG3M, AssignmentData$Output1)
cor2 <- cor(AssignmentData$USGG6M, AssignmentData$Output1)
cor3 <- cor(AssignmentData$USGG2YR, AssignmentData$Output1)
cor4 <- cor(AssignmentData$USGG3YR, AssignmentData$Output1)
cor5 <- cor(AssignmentData$USGG5YR, AssignmentData$Output1)
cor6 <- cor(AssignmentData$USGG10YR, AssignmentData$Output1)
cor7 <- cor(AssignmentData$USGG30YR, AssignmentData$Output1)
Input.correlation <- c(cor1, cor2, cor3, cor4, cor5, cor6, cor7)
Input.correlation.matrix <- matrix(data = Input.correlation, nrow = 7, byrow = FALSE)
rownames(Input.correlation.matrix) <- c("Input.1","Input.2","Input.3","Input.4","Input.5","Input.6","Input.7")
colnames(Input.correlation.matrix) <- c("Correlation")
Input.correlation.matrix
## Correlation
## Input.1 0.9812265
## Input.2 0.9871111
## Input.3 0.9983139
## Input.4 0.9989602
## Input.5 0.9958368
## Input.6 0.9844737
## Input.7 0.9671263
All input variables are highly correlated to the output variable with each correlation greater than .967. However, the greatest correlation is coming from Input 4, which is the USGG3YR variable.
Input1.linear.Model <- lm(Output1~USGG3M, AssignmentData)
Input2.linear.Model <- lm(Output1~USGG6M, AssignmentData)
Input3.linear.Model <- lm(Output1~USGG2YR, AssignmentData)
Input4.linear.Model <- lm(Output1~USGG3YR, AssignmentData)
Input5.linear.Model <- lm(Output1~USGG5YR, AssignmentData)
Input6.linear.Model <- lm(Output1~USGG10YR, AssignmentData)
Input7.linear.Model <- lm(Output1~USGG30YR, AssignmentData)
Coefficients.Input1.Matrix <- Slope.Intercept.Table<-t(sapply(1:7, function(x) lm(AssignmentData[,8]~AssignmentData[,x])$coefficient))
Coefficients.Input1.Matrix
## (Intercept) AssignmentData[, x]
## [1,] -11.72318 2.507561
## [2,] -12.09753 2.497235
## [3,] -13.05577 2.400449
## [4,] -13.86162 2.455793
## [5,] -15.43665 2.568742
## [6,] -18.06337 2.786991
## [7,] -21.08590 3.069561
matplot(AssignmentData[,8],type="l",xaxt="n",xlab = "Input1")
lines(Input1.linear.Model$fitted.values,col="red")
matplot(AssignmentData[,8],type="l",xaxt="n",xlab = "Input2")
lines(Input2.linear.Model$fitted.values,col="red")
matplot(AssignmentData[,8],type="l",xaxt="n",xlab = "Input3")
lines(Input3.linear.Model$fitted.values,col="red")
matplot(AssignmentData[,8],type="l",xaxt="n",xlab = "Input4")
lines(Input4.linear.Model$fitted.values,col="red")
matplot(AssignmentData[,8],type="l",xaxt="n",xlab = "Input5")
lines(Input5.linear.Model$fitted.values,col="red")
matplot(AssignmentData[,8],type="l",xaxt="n",xlab = "Input6")
lines(Input6.linear.Model$fitted.values,col="red")
matplot(AssignmentData[,8],type="l",xaxt="n",xlab = "Input7")
lines(Input7.linear.Model$fitted.values,col="red")
At first glance, it looks as though the linear models that fit the best are USGG2YR and USGG3YR. USGG 3YR seems to be the model that fit the best. This is consistent with the previous finding that USGG3YR possesses the highest correlation against the output variable.
c(Total.Variance=var(AssignmentData[,8]),Unexplained.Variance=summary(Input7.linear.Model)$sigma^2,Explained.Variance=var(AssignmentData[,8])-summary(Input7.linear.Model)$sigma^2)
## Total.Variance Unexplained.Variance Explained.Variance
## 76.804438 4.967286 71.837152
Coefficients.Input2.Matrix <- Slope.Intercept.Table2<-t(sapply(1:7, function(x) lm(AssignmentData[,x]~AssignmentData[,8])$coefficient))
Coefficients.Input2.Matrix
## (Intercept) AssignmentData[, 8]
## [1,] 4.675134 0.3839609
## [2,] 4.844370 0.3901870
## [3,] 5.438888 0.4151851
## [4,] 5.644458 0.4063541
## [5,] 6.009421 0.3860610
## [6,] 6.481316 0.3477544
## [7,] 6.869355 0.3047124
Make the easing column equal to 0 during the easing periods and NA otherwise. Make the tightening column equal to 1 during the tightening periods and NA otherwise.
AssignmentDataLogistic <- data.matrix(AssignmentData,rownames.force="automatic")
EasingPeriods <- AssignmentDataLogistic[,9]
EasingPeriods[AssignmentDataLogistic[,9]==1] <- 0
TighteningPeriods <- AssignmentDataLogistic[,10]
cbind(EasingPeriods,TighteningPeriods)[c(550:560,900:910,970:980),]
## EasingPeriods TighteningPeriods
## 3/29/1983 NA NA
## 3/30/1983 NA NA
## 3/31/1983 NA NA
## 4/4/1983 NA 1
## 4/5/1983 NA 1
## 4/6/1983 NA 1
## 4/7/1983 NA 1
## 4/8/1983 NA 1
## 4/11/1983 NA 1
## 4/12/1983 NA 1
## 4/13/1983 NA 1
## 8/27/1984 NA 1
## 8/28/1984 NA 1
## 8/29/1984 NA 1
## 8/30/1984 NA 1
## 8/31/1984 NA 1
## 9/4/1984 NA NA
## 9/5/1984 NA NA
## 9/6/1984 NA NA
## 9/7/1984 NA NA
## 9/10/1984 NA NA
## 9/11/1984 NA NA
## 12/10/1984 0 NA
## 12/11/1984 0 NA
## 12/12/1984 0 NA
## 12/13/1984 0 NA
## 12/14/1984 0 NA
## 12/17/1984 0 NA
## 12/18/1984 0 NA
## 12/19/1984 0 NA
## 12/20/1984 0 NA
## 12/21/1984 0 NA
## 12/24/1984 0 NA
Remove the periods of neither easing nor tightening.
All.NAs <- is.na(EasingPeriods)&is.na(TighteningPeriods)
AssignmentDataLogistic.EasingTighteningOnly <- AssignmentDataLogistic
AssignmentDataLogistic.EasingTighteningOnly[,9] <- EasingPeriods
AssignmentDataLogistic.EasingTighteningOnly <- AssignmentDataLogistic.EasingTighteningOnly[!All.NAs,]
AssignmentDataLogistic.EasingTighteningOnly[is.na(AssignmentDataLogistic.EasingTighteningOnly[,10]),10]<-0
matplot(AssignmentDataLogistic.EasingTighteningOnly[,-c(9,10)],type="l",ylab="Data and Binary Fed Mode")
lines(AssignmentDataLogistic.EasingTighteningOnly[,10]*20,col="red")
The visualization above shows that there were three tightening periods during the 30 years when the data was collected. Visually, it does look like the output variable generally increases during these tightening periods. Tightening periods are generally a time where policies are put in place in order to curb an economy that is seen as growing too quickly. Meanwhile, easing periods are the opposite, it is a time where interest rates are lowered in order to stimulate the economy.
LogisticModel.TighteningEasing_3M <- glm(AssignmentDataLogistic.EasingTighteningOnly[,10]~
AssignmentDataLogistic.EasingTighteningOnly[,1],family=binomial(link=logit))
summary(LogisticModel.TighteningEasing_3M)$coefficients
## Estimate Std. Error
## (Intercept) -2.1525619 0.17328095
## AssignmentDataLogistic.EasingTighteningOnly[, 1] 0.1863761 0.02143723
## z value Pr(>|z|)
## (Intercept) -12.422380 1.975941e-35
## AssignmentDataLogistic.EasingTighteningOnly[, 1] 8.694041 3.497715e-18
matplot(AssignmentDataLogistic.EasingTighteningOnly[,-c(9,10)],type="l",ylab="Data and Fitted Values")
lines(AssignmentDataLogistic.EasingTighteningOnly[,10]*20,col="red")
lines(LogisticModel.TighteningEasing_3M$fitted.values*20,col="green")
LogisticModel.TighteningEasing_All <- glm(AssignmentDataLogistic.EasingTighteningOnly[,10]~
AssignmentDataLogistic.EasingTighteningOnly[,1:7],family=binomial(link=logit))
summary(LogisticModel.TighteningEasing_All)
##
## Call:
## glm(formula = AssignmentDataLogistic.EasingTighteningOnly[, 10] ~
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7], family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2113 -0.8595 -0.5935 1.1306 2.5530
##
## Coefficients:
## Estimate
## (Intercept) -4.7552
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3M -3.3456
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG6M 4.1559
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG2YR 3.9460
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3YR -3.4642
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG5YR -3.2115
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG10YR -0.9705
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG30YR 3.3254
## Std. Error
## (Intercept) 0.4312
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3M 0.2666
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG6M 0.3748
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG2YR 0.7554
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3YR 0.9340
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG5YR 0.7795
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG10YR 0.9764
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG30YR 0.6138
## z value
## (Intercept) -11.029
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3M -12.548
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG6M 11.089
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG2YR 5.224
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3YR -3.709
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG5YR -4.120
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG10YR -0.994
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG30YR 5.418
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3M < 2e-16 ***
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG6M < 2e-16 ***
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG2YR 1.75e-07 ***
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3YR 0.000208 ***
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG5YR 3.79e-05 ***
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG10YR 0.320214
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG30YR 6.04e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2983.5 on 2357 degrees of freedom
## Residual deviance: 2629.6 on 2350 degrees of freedom
## AIC: 2645.6
##
## Number of Fisher Scoring iterations: 4
summary(LogisticModel.TighteningEasing_All)$coefficients[,c(1,4)]
## Estimate
## (Intercept) -4.7551928
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3M -3.3456116
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG6M 4.1558535
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG2YR 3.9460296
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3YR -3.4642455
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG5YR -3.2115319
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG10YR -0.9705444
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG30YR 3.3253517
## Pr(>|z|)
## (Intercept) 2.784283e-28
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3M 4.073045e-36
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG6M 1.422964e-28
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG2YR 1.751687e-07
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG3YR 2.080617e-04
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG5YR 3.786229e-05
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG10YR 3.202140e-01
## AssignmentDataLogistic.EasingTighteningOnly[, 1:7]USGG30YR 6.036041e-08
matplot(AssignmentDataLogistic.EasingTighteningOnly[,-c(9,10)],type="l",ylab="Results of Logistic Regression")
lines(AssignmentDataLogistic.EasingTighteningOnly[,10]*20,col="red")
lines(LogisticModel.TighteningEasing_All$fitted.values*20,col="green")
We are able to see the coefficients of the logistic regression by using the summary function. The first coefficient is the intercept. After estimating logistic regression using all input variables as predictors, the intercept (beta-0) estimate is -4.7552. This means that the logodds of a day being in a tightening period with the input variables (yield to maturity) of 0 is -4.7552. In addition, the negative value means that the probability of a day being in an easing period is higher than the probability of a day being in a tightening period. This is consistent with the tightening/easing visualization shown previously, which shows more days in the easing period than days being in the tightening period. Meanwhile, the next set of coefficients show a range of values signifying the expected change in logodds for a one-unit increase in the input variables (beta-1). A positive value would indicate that the input variable is associated with an increase in odds for every one-unit increase. Input variables 2 (USGG6M), 3 (USGG2YR) and 7 (USGG30YR) have positive coefficients, which means that there is a higher likelihood that a day is in a tightening period when these three variables are higher. Among those three inputs variables, input 2 (USGG6M) was the variable with the highest expected change in logodds for every one-unit increase with a coefficient of 4.16. In addition, the summary statistics for the coefficients tell us that all the input variables are statistically ignificant except for input 6 (USGG19YR).
plot(AssignmentDataLogistic.EasingTighteningOnly[,8], LogisticModel.TighteningEasing_All$fitted.values)
Based on this plot, the higher probability of a day being in a tightening period occurs more often when the output variable is higher (or above the value of 10). There seems to be a cluster of fitted values between the range 0 and 0.6 for Output Values below 10. Overall, it seems like it is more likely for a day to be in a tightening period when the Output Variable is higher than usual.
plot(AssignmentDataLogistic.EasingTighteningOnly[,10], LogisticModel.TighteningEasing_All$fitted.values)
This plot is a little more challenging to analyze due to the high amount of samples in both sides. But looking closely, the minimum and maximum values for the tightening period are higher than in easing periods. In addition, fitted values tend to cluster (darker regions) in the range between 0.0 to 0.6 for days in an easing period, while fitted values tend to cluster in a higher range between 0.2 to 0.75 for days in a tightening period.
y <- LogisticModel.TighteningEasing_All$fitted.values
period <- ifelse(AssignmentDataLogistic.EasingTighteningOnly[,10]==1, "Tightening", "Easing")
fitted.values.logistic <- data.frame(period,y)
library(ggplot2)
ggplot(data = fitted.values.logistic, aes(x=y, fill=period)) +
geom_histogram()+stat_bin(bins = 20)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram above shows how the fitted values are distributed for easing periods and tightening periods. The population of days that were in an easing period has a positive skew. The population of days in a tightening period is showing more characteristics of a normal distribution with a mean at about 0.375. The mean for tightening periods occurs at around 0.375, while it occurs at around .300 for easing periods. If you look more closely at the right-side tails of the distribution, there is a higher proportion of tightening days than easing days. The fact that there seems to be a higher proportion of days in tightening periods than easing periods on the right side of the histogram (range of 0.5 and above) tells me that the model is doing a decent job of not classifying days that are in easing periods as days in a tightening period (false positives). Overall, I would like to see the peak for tightening period to shift towards the right with more refinement.
Log.Odds <- predict(LogisticModel.TighteningEasing_All)
plot(Log.Odds,type="l")
Probabilities<-1/(exp(-Log.Odds)+1)
plot(LogisticModel.TighteningEasing_All$fitted.values,type="l",ylab="Fitted Values & Log-Odds")
lines(Probabilities,col="red")
When comparing the two plots between log-odds and probabilities, it is easy to see that the fitted values coming from the logistic regression is merely just a transformation of the log-odds of an event successfully occuring (i.e. its probability). In this instance, a successful event is defined as a day being in a tightening period.
Below we show only two of possible combinations: full model containing all 7 predictors and Null model containing only intercept, but none of the 7 predictors. Estimate other possible combinations.
AssignmentDataRegressionComparison<-data.matrix(AssignmentData[,-c(9,10)],rownames.force="automatic")
AssignmentDataRegressionComparison<-AssignmentData[,-c(9,10)]
RegressionModelComparison.Full <- lm(Output1~., data = AssignmentDataRegressionComparison)
Look at coefficients, R2, adjusted R2, degrees of freedom.
summary(RegressionModelComparison.Full)$coefficients[,c(1,2,3,4)]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.9041591 1.056850e-10 -141024294891 0
## USGG3M 0.3839609 9.860401e-11 3893968285 0
## USGG6M 0.3901870 1.500111e-10 2601053702 0
## USGG2YR 0.4151851 2.569451e-10 1615851177 0
## USGG3YR 0.4063541 3.299038e-10 1231735395 0
## USGG5YR 0.3860610 2.618339e-10 1474449865 0
## USGG10YR 0.3477544 2.800269e-10 1241860763 0
## USGG30YR 0.3047124 1.566487e-10 1945195584 0
summary(RegressionModelComparison.Full)$r.squared
## [1] 1
Adjusted R-squared
summary(RegressionModelComparison.Full)$adj.r.squared
## [1] 1
summary(RegressionModelComparison.Full)$df
## [1] 8 8292 8
After fitting the linear model with all inputs, we can look at the summary of the model to see how well it fit. We will first look into the model’s coefficients and its statistical significance. The multiple regression model estimates that the intercept (beta-0) is -14.9. Meanwhile, the estimated coefficients of each variable are between the range of .304 (USGG30YR) to .415 (USGG2YR). According to the summary, all variables are statistically significant. One method to assess the model fit is to look at its r-squared/adjusted r-squared. The value for r-squared/adjusted r-squared is between 0 and 1. Generally, a higher r-squared/adjusted r-suared indicates that a model fits well. The r-squared/adjusted r-squared for this particular model shows a value of 1 suggesting that this model fits extremely well. In other words, fitting all of the inputs accounts for 100% of the variation in the output variable can be explained by our model. One possibility for this result may be due to the multicollinearity present among the independent variables (see below).
plot(AssignmentDataRegressionComparison[0:8])
cor(AssignmentDataRegressionComparison[0:7])
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR
## USGG3M 1.0000000 0.9979328 0.9843135 0.9768273 0.9603153 0.9348717
## USGG6M 0.9979328 1.0000000 0.9910013 0.9845272 0.9691905 0.9444964
## USGG2YR 0.9843135 0.9910013 1.0000000 0.9987920 0.9916366 0.9753836
## USGG3YR 0.9768273 0.9845272 0.9987920 1.0000000 0.9963859 0.9836936
## USGG5YR 0.9603153 0.9691905 0.9916366 0.9963859 1.0000000 0.9948318
## USGG10YR 0.9348717 0.9444964 0.9753836 0.9836936 0.9948318 1.0000000
## USGG30YR 0.9069437 0.9167172 0.9539191 0.9648236 0.9819750 0.9956347
## USGG30YR
## USGG3M 0.9069437
## USGG6M 0.9167172
## USGG2YR 0.9539191
## USGG3YR 0.9648236
## USGG5YR 0.9819750
## USGG10YR 0.9956347
## USGG30YR 1.0000000
Checking for multicollinearity in a dataset means to check for how correlated the independent variables are to each other. After looking at the pairs plot and the correlation matrix above, it is clear that there is a high amount of correlation occurring for all input variables against each other.
RegressionModelComparison.Null <- lm(Output1~1, data = AssignmentDataRegressionComparison)
summary(RegressionModelComparison.Null)$coefficients[,c(1,2,3,4)]
## Estimate Std. Error t value Pr(>|t|)
## 1.420082e-11 9.619536e-02 1.476248e-10 1.000000e+00
summary(RegressionModelComparison.Null)$r.squared
## [1] 0
Adjusted R-squared
summary(RegressionModelComparison.Null)$adj.r.squared
## [1] 0
summary(RegressionModelComparison.Null)$df
## [1] 1 8299 1
After fitting the model with only the output variable as the input, we see that this model shows an r-squared/adjusted r-squared value of 0. Setting the linear model to fit with only the ouput as the input variable produces a model where its predicted values’ total squared residuals (SSE) equal to the total sum of squares (SST). This is due to the fact that the model produced a line that is the average of all the Output field. Therefore, when the r-squared is calculated (1 - (SSE/SST)) for the model, it produces a value of 0.
anova(RegressionModelComparison.Full,RegressionModelComparison.Null)
## Analysis of Variance Table
##
## Model 1: Output1 ~ USGG3M + USGG6M + USGG2YR + USGG3YR + USGG5YR + USGG10YR +
## USGG30YR
## Model 2: Output1 ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8292 0
## 2 8299 637400 -7 -637400 3.73e+22 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova (analysis of variance) was used to compare the full model and the null model. The summary of the comparison tells me that the change in sum of squares error from model 1 to model 2 is statistically significant.
drop.method <- lm(Output1~.,data=AssignmentDataRegressionComparison)
drop1(drop.method,scope= c("USGG3M","USGG6M","USGG2YR","USGG3YR","USGG5YR","USGG10YR","USGG30YR"))
## Warning: attempting model selection on an essentially perfect fit is
## nonsense
## Single term deletions
##
## Model:
## Output1 ~ USGG3M + USGG6M + USGG2YR + USGG3YR + USGG5YR + USGG10YR +
## USGG30YR
## Df Sum of Sq RSS AIC
## <none> 0.000 -336591
## USGG3M 1 37.016 37.016 -44911
## USGG6M 1 16.516 16.516 -51609
## USGG2YR 1 6.374 6.374 -59512
## USGG3YR 1 3.704 3.704 -64018
## USGG5YR 1 5.307 5.307 -61032
## USGG10YR 1 3.765 3.765 -63882
## USGG30YR 1 9.237 9.237 -56433
Since the model is “essentially a perfect fit,” there is no need to drop any variables suggesting that the full model is best.
add.method <- lm(Output1~1,data=AssignmentDataRegressionComparison)
add1(add.method,scope=c("USGG3M","USGG6M","USGG2YR","USGG3YR","USGG5YR","USGG10YR","USGG30YR"))
## Single term additions
##
## Model:
## Output1 ~ 1
## Df Sum of Sq RSS AIC
## <none> 637400 36033
## USGG3M 1 613692 23708 8715
## USGG6M 1 621075 16325 5618
## USGG2YR 1 635252 2148 -11217
## USGG3YR 1 636075 1325 -15226
## USGG5YR 1 632104 5296 -3725
## USGG10YR 1 617761 19639 7153
## USGG30YR 1 596181 41219 13306
comb1 <- lm(Output1 ~ USGG3YR, AssignmentDataRegressionComparison)
summary(comb1)$r.squared
## [1] 0.9979215
add1(comb1,scope=c("USGG3M","USGG6M","USGG2YR","USGG5YR","USGG10YR","USGG30YR"))
## Single term additions
##
## Model:
## Output1 ~ USGG3YR
## Df Sum of Sq RSS AIC
## <none> 1324.83 -15226
## USGG3M 1 407.98 916.85 -18279
## USGG6M 1 270.17 1054.66 -17117
## USGG2YR 1 82.93 1241.91 -15761
## USGG5YR 1 20.94 1303.89 -15356
## USGG10YR 1 64.05 1260.78 -15636
## USGG30YR 1 100.79 1224.04 -15881
comb2 <- lm(Output1 ~ USGG3YR + USGG3M, AssignmentDataRegressionComparison)
summary(comb2)$r.squared
## [1] 0.9985616
After the conducting the add method, USGG3YR is the variable selected and had a least r-squared value at 0.9979. USGG3YR was then combined with USGG3M, which produced a slight increase of the r-squared value at .9986. There was not any significant value added to adding any more predictors to USGG3YR, so the best model will be the one using only this input.
Perform rolling window analysis of the yields data. Use package zoo for rolling window analysis.
Window.width <- 20; Window.shift <- 5
library(zoo)
## Warning: package 'zoo' was built under R version 3.4.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
all.means<-rollapply(AssignmentDataRegressionComparison,width=Window.width,by=Window.shift,by.column=TRUE, mean)
head(all.means,10)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR Output1
## [1,] 15.0405 14.0855 13.2795 12.9360 12.7825 12.5780 12.1515 20.14842
## [2,] 15.1865 14.1440 13.4855 13.1085 12.9310 12.7370 12.3370 20.55208
## [3,] 15.2480 14.2755 13.7395 13.3390 13.1500 12.9480 12.5500 21.04895
## [4,] 14.9345 14.0780 13.7750 13.4765 13.2385 13.0515 12.6610 21.02611
## [5,] 14.7545 14.0585 13.9625 13.6890 13.4600 13.2295 12.8335 21.31356
## [6,] 14.6025 14.0115 14.0380 13.7790 13.5705 13.3050 12.8890 21.39061
## [7,] 14.0820 13.5195 13.8685 13.6710 13.4815 13.1880 12.7660 20.77200
## [8,] 13.6255 13.0635 13.6790 13.5735 13.4270 13.1260 12.6950 20.23626
## [9,] 13.2490 12.6810 13.5080 13.4575 13.3680 13.0770 12.6470 19.76988
## [10,] 12.9545 12.4225 13.4140 13.4240 13.3850 13.1115 12.6755 19.53054
Count <- 1:length(AssignmentDataRegressionComparison[,1])
Rolling.window.matrix <- rollapply(Count,width=Window.width,by=Window.shift,by.column=FALSE,
FUN=function(z) z)
Rolling.window.matrix[1:10,]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 1 2 3 4 5 6 7 8 9 10 11 12 13
## [2,] 6 7 8 9 10 11 12 13 14 15 16 17 18
## [3,] 11 12 13 14 15 16 17 18 19 20 21 22 23
## [4,] 16 17 18 19 20 21 22 23 24 25 26 27 28
## [5,] 21 22 23 24 25 26 27 28 29 30 31 32 33
## [6,] 26 27 28 29 30 31 32 33 34 35 36 37 38
## [7,] 31 32 33 34 35 36 37 38 39 40 41 42 43
## [8,] 36 37 38 39 40 41 42 43 44 45 46 47 48
## [9,] 41 42 43 44 45 46 47 48 49 50 51 52 53
## [10,] 46 47 48 49 50 51 52 53 54 55 56 57 58
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 14 15 16 17 18 19 20
## [2,] 19 20 21 22 23 24 25
## [3,] 24 25 26 27 28 29 30
## [4,] 29 30 31 32 33 34 35
## [5,] 34 35 36 37 38 39 40
## [6,] 39 40 41 42 43 44 45
## [7,] 44 45 46 47 48 49 50
## [8,] 49 50 51 52 53 54 55
## [9,] 54 55 56 57 58 59 60
## [10,] 59 60 61 62 63 64 65
Points.of.calculation <- Rolling.window.matrix[,10]
Points.of.calculation[1:10]
## [1] 10 15 20 25 30 35 40 45 50 55
length(Points.of.calculation)
## [1] 1657
Means.forPlot <- rep(NA,length(AssignmentDataRegressionComparison[,1]))
Means.forPlot[Points.of.calculation] <- all.means[,1]
Means.forPlot[1:50]
## [1] NA NA NA NA NA NA NA NA
## [9] NA 15.0405 NA NA NA NA 15.1865 NA
## [17] NA NA NA 15.2480 NA NA NA NA
## [25] 14.9345 NA NA NA NA 14.7545 NA NA
## [33] NA NA 14.6025 NA NA NA NA 14.0820
## [41] NA NA NA NA 13.6255 NA NA NA
## [49] NA 13.2490
cbind(AssignmentDataRegressionComparison[,1],Means.forPlot)[1:50,]
## Means.forPlot
## [1,] 13.52 NA
## [2,] 13.58 NA
## [3,] 14.50 NA
## [4,] 14.76 NA
## [5,] 15.20 NA
## [6,] 15.22 NA
## [7,] 15.24 NA
## [8,] 15.08 NA
## [9,] 15.25 NA
## [10,] 15.15 15.0405
## [11,] 15.79 NA
## [12,] 15.32 NA
## [13,] 15.71 NA
## [14,] 15.72 NA
## [15,] 15.70 15.1865
## [16,] 15.35 NA
## [17,] 15.20 NA
## [18,] 15.06 NA
## [19,] 14.87 NA
## [20,] 14.59 15.2480
## [21,] 14.90 NA
## [22,] 14.85 NA
## [23,] 14.67 NA
## [24,] 14.74 NA
## [25,] 15.32 14.9345
## [26,] 15.52 NA
## [27,] 15.46 NA
## [28,] 15.54 NA
## [29,] 15.51 NA
## [30,] 15.14 14.7545
## [31,] 15.02 NA
## [32,] 14.48 NA
## [33,] 14.09 NA
## [34,] 14.23 NA
## [35,] 14.15 14.6025
## [36,] 14.20 NA
## [37,] 14.14 NA
## [38,] 14.22 NA
## [39,] 14.52 NA
## [40,] 14.39 14.0820
## [41,] 14.49 NA
## [42,] 14.51 NA
## [43,] 14.29 NA
## [44,] 14.16 NA
## [45,] 13.99 13.6255
## [46,] 13.92 NA
## [47,] 13.66 NA
## [48,] 13.21 NA
## [49,] 13.02 NA
## [50,] 12.95 13.2490
plot(Means.forPlot,col="red")
lines(AssignmentDataRegressionComparison[,1])
all.dailydiff <- rollapply(AssignmentDataRegressionComparison,width=2,by=1,by.column=TRUE, ascending = FALSE, diff)
head(all.dailydiff,10)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR Output1
## [1,] 0.06 0.07 0.14 0.03 -0.08 -0.04 0.00 0.07587222
## [2,] 0.92 0.74 0.50 0.47 0.40 0.27 0.22 1.35591615
## [3,] 0.26 0.10 0.17 0.17 0.07 -0.03 0.02 0.30119608
## [4,] 0.44 0.30 0.44 0.33 0.20 0.22 0.22 0.82353207
## [5,] 0.02 -0.07 -0.36 -0.34 -0.17 -0.12 -0.05 -0.42985741
## [6,] 0.02 -0.13 0.13 0.03 -0.03 0.08 0.00 0.03935812
## [7,] -0.16 -0.20 -0.35 -0.22 -0.07 0.00 -0.01 -0.40425521
## [8,] 0.17 0.19 0.30 0.27 0.16 0.09 0.18 0.52159590
## [9,] -0.10 -0.11 -0.17 -0.17 -0.11 -0.09 -0.12 -0.33130842
## [10,] 0.64 0.75 0.54 0.07 0.26 0.11 0.12 0.96621424
rolling.sd <- rollapply(all.dailydiff,width=Window.width,by=Window.shift,by.column=TRUE, sd)
head(rolling.sd)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## [1,] 0.3433258 0.3262462 0.2748258 0.2030161 0.1713192 0.1299585 0.1202147
## [2,] 0.2933383 0.2907504 0.2261811 0.1499219 0.1450082 0.1146895 0.1192201
## [3,] 0.2613180 0.2437530 0.2006201 0.1632596 0.1654110 0.1459308 0.1351909
## [4,] 0.2551754 0.2469663 0.1989446 0.1692794 0.1717219 0.1551052 0.1422183
## [5,] 0.2480551 0.2481595 0.2102004 0.1786057 0.1744767 0.1643960 0.1516540
## [6,] 0.1963884 0.2363672 0.2095082 0.1809180 0.1822917 0.1664956 0.1537351
## Output1
## [1,] 0.5639875
## [2,] 0.4707427
## [3,] 0.4681168
## [4,] 0.4786189
## [5,] 0.4888569
## [6,] 0.4788897
rolling.dates <- rollapply(AssignmentDataRegressionComparison[-1,],width=Window.width,by=Window.shift,
by.column=FALSE,FUN=function(z) rownames(z))
rownames(rolling.sd) <- rolling.dates[,10]
matplot(rolling.sd[,c(1,5,7,8)],xaxt="n",type="l",col=c("black","red","blue","green"))
axis(side=1,at=1:1656,rownames(rolling.sd))
high.volatility.periods <- rownames(rolling.sd)[rolling.sd[,8]>.5]
high.volatility.periods
## [1] "1/19/1981" "3/25/1981" "4/1/1981" "4/8/1981" "4/23/1981"
## [6] "4/30/1981" "5/7/1981" "5/14/1981" "5/21/1981" "5/29/1981"
## [11] "6/5/1981" "6/12/1981" "6/19/1981" "6/26/1981" "7/6/1981"
## [16] "7/13/1981" "7/20/1981" "7/27/1981" "10/28/1981" "11/5/1981"
## [21] "11/13/1981" "11/20/1981" "11/30/1981" "12/7/1981" "12/14/1981"
## [26] "12/29/1981" "1/14/1982" "1/21/1982" "1/28/1982" "2/4/1982"
## [31] "2/11/1982" "7/29/1982" "8/5/1982" "8/12/1982" "8/19/1982"
## [36] "8/26/1982" "9/24/1982" "10/1/1982" "10/8/1982" "10/18/1982"
## [41] "10/13/1987" "10/20/1987" "10/27/1987" "11/19/2007" "11/26/2007"
## [46] "11/12/2008" "11/19/2008"
Coefficients <- rollapply(AssignmentDataRegressionComparison,width=Window.width,by=Window.shift,by.column=FALSE,
FUN=function(z) coef(lm(Output1~USGG3M+USGG5YR+USGG30YR,data=as.data.frame(z))))
rolling.dates <- rollapply(AssignmentDataRegressionComparison[,1:8],width=Window.width,by=Window.shift,by.column=FALSE,
FUN=function(z) rownames(z))
rownames(Coefficients) <- rolling.dates[,10]
Coefficients[1:10,]
## (Intercept) USGG3M USGG5YR USGG30YR
## 1/16/1981 -15.70877 0.6723609 1.797680 0.2276011
## 1/23/1981 -15.96684 0.6948992 1.480514 0.5529139
## 1/30/1981 -16.77273 0.7078197 1.434388 0.6507280
## 2/6/1981 -16.90734 0.7279033 1.470083 0.6003377
## 2/17/1981 -17.46896 0.7343406 1.361289 0.7499705
## 2/24/1981 -17.04722 0.7357663 1.295641 0.7844908
## 3/3/1981 -17.67901 0.8544681 1.396653 0.5945022
## 3/10/1981 -17.72402 0.9162385 1.654274 0.2571200
## 3/17/1981 -17.00260 0.9265767 1.647852 0.1951273
## 3/24/1981 -16.84302 0.9102780 1.477727 0.3788401
pairs(Coefficients)
The pairs plot above tells a completely different story from previous analysis. In previous analysis, we looked at the data as a whole without taking into account periods of time. The pairs plot above of the coefficients of a linear model fitting three predictors (USGG3M, USGG5YR, and USGG30YR) into the rolling window data show us multicollinearity between the input variables. USGG3M and USGG5YR do not show any meaningful between each other or with USGG30YR. However, the pairs plot does show a strong negative relationship between USGG5YR and USGG30YR.
matplot(Coefficients[,-1],xaxt="n",type="l",col=c("black","red","green"))
axis(side=1,at=1:1657,rownames(Coefficients))
high.slopespread.periods <- rownames(Coefficients)[Coefficients[,3]-Coefficients[,4]>3]
jump.slopes <- rownames(Coefficients)[Coefficients[,3]>3]
high.slopespread.periods
## [1] "4/26/1982" "12/15/1982" "9/16/1985" "5/12/1987" "5/19/1987"
## [6] "5/27/1987" "9/25/1987" "3/15/1988" "9/27/1988" "10/5/1988"
## [11] "3/10/1989" "2/5/1992" "8/3/1994" "12/8/1994" "6/14/1996"
## [16] "5/9/1997" "5/16/1997" "5/23/1997" "5/30/1997" "12/26/2000"
## [21] "1/2/2001" "7/25/2001" "8/1/2001" "11/13/2003" "8/12/2004"
## [26] "12/16/2004"
The dates shown above are when the difference between the predicted USGG30YR coefficient and predicted USGG5YR coefficient is more than 3. There are 26 such days identified with the most recent date coming in 12/16/2004.
jump.slopes
## [1] "12/16/2004"
The date shown above is identified as a ‘jump slope’. A jump slope occurs when the predicted USGG30YR coefficient is more than 3. There was one such date on 12/16/2004.
When comparing the picture of coefficients and the picture of pairs, we can see that it is consistent with the observation that the USGG5YR coefficients and the USGG30YR coefficients have a negative relationship. In the picture of coefficients, USGG5YR coefficients is the red line, while USGG30YR coefficients is the green line. It is clearly shown that the red line tends to be at the higher part of the graph, while the green line is in the lower part of the graph. In other words, there is a negative relationship bettween the two lines, which validates the observed relationship from the picture of pairs.
# R-squared
r.squared <- rollapply(AssignmentDataRegressionComparison,width=Window.width,by=Window.shift,by.column=FALSE,
FUN=function(z) summary(lm(Output1~USGG3M+USGG5YR+USGG30YR,data=as.data.frame(z)))$r.squared)
r.squared <- cbind(rolling.dates[,10],r.squared)
r.squared[1:10,]
## r.squared
## [1,] "1/16/1981" "0.995046300986446"
## [2,] "1/23/1981" "0.992485868136646"
## [3,] "1/30/1981" "0.998641209587999"
## [4,] "2/6/1981" "0.998849080081881"
## [5,] "2/17/1981" "0.997958757207598"
## [6,] "2/24/1981" "0.996489757136839"
## [7,] "3/3/1981" "0.99779753570421"
## [8,] "3/10/1981" "0.998963395226792"
## [9,] "3/17/1981" "0.998729445388789"
## [10,] "3/24/1981" "0.997073000898673"
plot(r.squared[,2],xaxt="n",ylim=c(0,1))
axis(side=1,at=1:1657,rownames(Coefficients))
#### Analysis The plot above shows the value of r-squared for the linear model that was just applied on the rolling window data using USGG3M, USGG5YR, and USGG30YR as predictors. Based on this plot, it looks like r-squared values tend to be very high around with values close to 1.There are only a handful of days that would not be considered as “high” r-squared values and could be considered as outliers.
(low.r.squared.periods<-r.squared[r.squared[,2]<.9,1])
## [1] "6/24/1987" "6/27/1991" "4/28/2005" "6/20/2012"
The dates above are the rare days tat had r-squared vaues of below 0.90. When an r-squared value is lower, it means that the sum of squares error is higher. The model is not able to fully explain the variation that is occuring. This could mean that there might be some external factor that is not being taken into account for that particular day or period of time that contributed to the output variable being more difficult to fit.
# P-values
Pvalues <- rollapply(AssignmentDataRegressionComparison,width=Window.width,by=Window.shift,by.column=FALSE,
FUN=function(z) summary(lm(Output1~USGG3M+USGG5YR+USGG30YR,data=as.data.frame(z)))$coefficients[,4])
rownames(Pvalues) <- rolling.dates[,10]
Pvalues[1:10,]
## (Intercept) USGG3M USGG5YR USGG30YR
## 1/16/1981 1.193499e-10 3.764585e-10 2.391260e-07 2.538852e-01
## 1/23/1981 3.751077e-12 1.008053e-11 2.447369e-07 5.300949e-03
## 1/30/1981 3.106359e-18 1.406387e-14 4.040035e-09 3.626961e-05
## 2/6/1981 2.591522e-19 3.360104e-19 3.828054e-11 2.221691e-05
## 2/17/1981 1.897239e-16 6.578118e-17 1.461743e-09 1.331767e-04
## 2/24/1981 2.341158e-13 1.000212e-13 9.008221e-07 4.733543e-03
## 3/3/1981 5.435581e-14 1.535503e-11 3.357199e-06 6.010473e-02
## 3/10/1981 6.227624e-16 1.178498e-16 1.679479e-05 3.851840e-01
## 3/17/1981 9.592582e-17 7.065226e-20 1.459692e-05 5.025726e-01
## 3/24/1981 8.248747e-16 6.689840e-16 6.413371e-04 3.052705e-01
matplot(Pvalues,xaxt="n",col=c("black","blue","red","green"),type="o")
axis(side=1,at=1:1657,rownames(Coefficients))
rownames(Pvalues)[Pvalues[,2]>.5]
## [1] "7/15/1992" "7/26/1996" "8/2/1996" "11/7/2000" "5/30/2001"
## [6] "5/2/2002" "5/16/2002" "5/23/2002" "1/30/2003" "2/6/2003"
## [11] "7/24/2003" "7/31/2003" "8/7/2003" "11/20/2003" "12/18/2003"
## [16] "4/28/2005" "2/10/2006" "3/9/2007" "3/16/2007" "7/21/2009"
## [21] "10/6/2009" "10/13/2009" "12/28/2010" "1/11/2011" "3/1/2011"
## [26] "11/16/2011" "11/23/2011" "5/23/2012" "7/11/2012" "6/6/2013"
## [31] "1/16/2014" "1/30/2014" "3/6/2014"
rownames(Pvalues)[Pvalues[,3]>.5]
## [1] "12/1/1982" "3/16/1987" "4/28/1987" "6/24/1987" "9/3/1987" "9/11/1987"
## [7] "9/20/1988" "12/3/1999"
rownames(Pvalues)[Pvalues[,4]>.5]
## [1] "3/17/1981" "4/22/1981" "4/29/1981" "6/4/1981" "10/13/1981"
## [6] "11/19/1981" "2/3/1982" "2/26/1982" "4/2/1982" "4/12/1982"
## [11] "5/3/1982" "7/7/1982" "9/1/1982" "9/23/1982" "12/8/1982"
## [16] "2/18/1983" "3/1/1983" "5/4/1983" "12/30/1983" "1/9/1984"
## [21] "2/6/1984" "3/14/1984" "3/21/1984" "4/11/1984" "6/22/1984"
## [26] "6/29/1984" "10/31/1984" "11/16/1984" "11/26/1984" "12/17/1984"
## [31] "3/25/1985" "5/14/1985" "5/21/1985" "8/30/1985" "9/9/1985"
## [36] "9/23/1985" "10/1/1985" "10/8/1985" "10/16/1985" "10/23/1985"
## [41] "10/30/1985" "11/14/1985" "11/21/1985" "1/22/1986" "1/29/1986"
## [46] "5/5/1987" "12/16/1987" "1/25/1988" "2/1/1988" "2/16/1988"
## [51] "3/1/1988" "3/22/1988" "5/18/1988" "6/3/1988" "6/10/1988"
## [56] "6/24/1988" "7/25/1988" "8/15/1988" "12/5/1988" "2/2/1989"
## [61] "3/3/1989" "4/10/1989" "5/1/1989" "6/13/1989" "8/16/1989"
## [66] "9/14/1989" "9/21/1989" "10/3/1989" "10/11/1989" "10/18/1989"
## [71] "11/1/1989" "11/30/1989" "12/7/1989" "1/8/1990" "1/16/1990"
## [76] "6/15/1990" "7/30/1990" "8/6/1990" "10/2/1990" "10/10/1990"
## [81] "11/23/1990" "3/1/1991" "5/13/1991" "5/20/1991" "6/13/1991"
## [86] "7/5/1991" "7/19/1991" "9/16/1991" "2/12/1992" "2/20/1992"
## [91] "3/12/1992" "4/16/1992" "4/24/1992" "5/1/1992" "5/8/1992"
## [96] "6/2/1992" "6/9/1992" "6/16/1992" "8/19/1992" "8/26/1992"
## [101] "10/23/1992" "5/20/1993" "6/11/1993" "6/18/1993" "8/30/1993"
## [106] "12/15/1993" "12/22/1993" "3/16/1994" "3/30/1994" "4/6/1994"
## [111] "4/20/1994" "4/27/1994" "6/29/1994" "8/17/1994" "9/21/1994"
## [116] "9/28/1994" "12/22/1994" "12/29/1994" "1/5/1995" "1/12/1995"
## [121] "1/26/1995" "2/2/1995" "2/9/1995" "4/6/1995" "4/13/1995"
## [126] "8/25/1995" "9/29/1995" "10/27/1995" "11/3/1995" "11/10/1995"
## [131] "12/29/1995" "1/5/1996" "1/12/1996" "1/19/1996" "3/29/1996"
## [136] "5/31/1996" "6/21/1996" "7/12/1996" "7/19/1996" "7/26/1996"
## [141] "8/2/1996" "8/9/1996" "8/16/1996" "8/23/1996" "9/27/1996"
## [146] "10/4/1996" "12/6/1996" "2/28/1997" "3/7/1997" "4/18/1997"
## [151] "4/25/1997" "5/2/1997" "6/13/1997" "6/20/1997" "6/27/1997"
## [156] "7/4/1997" "10/10/1997" "10/17/1997" "12/12/1997" "12/19/1997"
## [161] "12/26/1997" "1/9/1998" "1/16/1998" "8/14/1998" "8/21/1998"
## [166] "8/28/1998" "9/18/1998" "9/25/1998" "12/4/1998" "12/11/1998"
## [171] "1/8/1999" "3/12/1999" "4/2/1999" "5/14/1999" "6/25/1999"
## [176] "7/9/1999" "7/30/1999" "8/20/1999" "9/10/1999" "9/24/1999"
## [181] "10/15/1999" "12/31/1999" "3/3/2000" "3/31/2000" "4/7/2000"
## [186] "4/14/2000" "4/21/2000" "7/25/2000" "11/21/2000" "11/28/2000"
## [191] "3/7/2001" "5/30/2001" "7/11/2001" "10/11/2001" "12/13/2001"
## [196] "1/24/2002" "1/31/2002" "8/29/2002" "9/26/2002" "12/26/2002"
## [201] "1/16/2003" "1/23/2003" "3/6/2003" "3/13/2003" "3/20/2003"
## [206] "3/27/2003" "5/1/2003" "6/19/2003" "6/26/2003" "7/3/2003"
## [211] "9/18/2003" "9/25/2003" "10/16/2003" "10/23/2003" "10/30/2003"
## [216] "11/20/2003" "1/1/2004" "2/5/2004" "2/26/2004" "3/4/2004"
## [221] "4/15/2004" "4/22/2004" "5/13/2004" "5/27/2004" "6/3/2004"
## [226] "6/17/2004" "6/24/2004" "7/8/2004" "10/14/2004" "10/21/2004"
## [231] "10/28/2004" "11/18/2004" "12/23/2004" "3/17/2005" "3/24/2005"
## [236] "3/31/2005" "4/7/2005" "7/21/2005" "8/11/2005" "9/22/2005"
## [241] "10/6/2005" "11/3/2005" "12/8/2005" "12/15/2005" "1/20/2006"
## [246] "5/12/2006" "5/19/2006" "5/26/2006" "6/2/2006" "6/9/2006"
## [251] "6/16/2006" "7/7/2006" "7/21/2006" "10/6/2006" "11/3/2006"
## [256] "1/12/2007" "2/23/2007" "3/16/2007" "4/27/2007" "5/25/2007"
## [261] "9/7/2007" "9/14/2007" "9/21/2007" "9/28/2007" "11/16/2007"
## [266] "5/5/2009" "6/1/2010" "6/15/2010" "1/25/2011" "2/1/2011"
## [271] "6/12/2014"
The plot of rolling p-values gives allow us to compare the significance of the input variables. Based on the plot, USGG30YR (green line) appears to be the least significant among the the other variables with 271 days that had p-values of greater than 0.50. USGG3M (blue line) is next least significant with 33 days with p-values greater than 0.50. The most significant predictor is USGG5YR with only 8 days that had p-values greater than 0.5. Visually, you can see that the red lines tend to be at the base of the graph, which validates that USGG5YR is the most significant predictor among the three.
AssignmentData.Output <- AssignmentData$Output1
AssignmentData <- data.matrix(AssignmentData[,1:7],rownames.force="automatic")
dim(AssignmentData)
## [1] 8300 7
head(AssignmentData)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## 1/5/1981 13.52 13.09 12.289 12.28 12.294 12.152 11.672
## 1/6/1981 13.58 13.16 12.429 12.31 12.214 12.112 11.672
## 1/7/1981 14.50 13.90 12.929 12.78 12.614 12.382 11.892
## 1/8/1981 14.76 14.00 13.099 12.95 12.684 12.352 11.912
## 1/9/1981 15.20 14.30 13.539 13.28 12.884 12.572 12.132
## 1/12/1981 15.22 14.23 13.179 12.94 12.714 12.452 12.082
# Select 3 variables. Explore dimensionality and correlation
AssignmentData.3M_2Y_5Y<-AssignmentData[,c(1,3,5)]
pairs(AssignmentData.3M_2Y_5Y)
Observe the 3D plot of the set. Use library rgl:
library("rgl")
## Warning: package 'rgl' was built under R version 3.4.4
rgl.points(AssignmentData.3M_2Y_5Y)
Covariance.Matrix <- cov(AssignmentData)
#Means
a <- mean(AssignmentData[,1])
b <- mean(AssignmentData[,2])
c <- mean(AssignmentData[,3])
d <- mean(AssignmentData[,4])
e <- mean(AssignmentData[,5])
f <- mean(AssignmentData[,6])
g <- mean(AssignmentData[,7])
k <- ncol(AssignmentData)
n <- nrow(AssignmentData)
Input.means <- c(a, b, c, d, e, f, g)
Input.means.matrix <- matrix(data = Input.means, nrow = n, byrow = FALSE)
## Warning in matrix(data = Input.means, nrow = n, byrow = FALSE): data length
## [7] is not a sub-multiple or multiple of the number of rows [8300]
Input.means.matrix <- matrix(data=1, nrow=n) %*% cbind(a,b,c,d,e,f,g)
#centering means
diff.matrix <- AssignmentData - Input.means.matrix
Manual.Covariance.Matrix <- (t(diff.matrix)%*% diff.matrix)*((n-1)^-1)
Manual.Covariance.Matrix
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR
## USGG3M 11.760393 11.855287 12.303031 11.942035 11.188856 9.924865
## USGG6M 11.855287 12.000510 12.512434 12.158422 11.406959 10.128890
## USGG2YR 12.303031 12.512434 13.284203 12.977542 12.279514 11.005377
## USGG3YR 11.942035 12.158422 12.977542 12.708647 12.068078 10.856033
## USGG5YR 11.188856 11.406959 12.279514 12.068078 11.543082 10.463386
## USGG10YR 9.924865 10.128890 11.005377 10.856033 10.463386 9.583483
## USGG30YR 8.587987 8.768702 9.600181 9.497246 9.212159 8.510632
## USGG30YR
## USGG3M 8.587987
## USGG6M 8.768702
## USGG2YR 9.600181
## USGG3YR 9.497246
## USGG5YR 9.212159
## USGG10YR 8.510632
## USGG30YR 7.624304
Covariance.Matrix
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR
## USGG3M 11.760393 11.855287 12.303031 11.942035 11.188856 9.924865
## USGG6M 11.855287 12.000510 12.512434 12.158422 11.406959 10.128890
## USGG2YR 12.303031 12.512434 13.284203 12.977542 12.279514 11.005377
## USGG3YR 11.942035 12.158422 12.977542 12.708647 12.068078 10.856033
## USGG5YR 11.188856 11.406959 12.279514 12.068078 11.543082 10.463386
## USGG10YR 9.924865 10.128890 11.005377 10.856033 10.463386 9.583483
## USGG30YR 8.587987 8.768702 9.600181 9.497246 9.212159 8.510632
## USGG30YR
## USGG3M 8.587987
## USGG6M 8.768702
## USGG2YR 9.600181
## USGG3YR 9.497246
## USGG5YR 9.212159
## USGG10YR 8.510632
## USGG30YR 7.624304
Maturities<-c(.25,.5,2,3,5,10,30)
contour(Maturities,Maturities,Covariance.Matrix)
Eigen.Decomposition.values <- eigen(Manual.Covariance.Matrix)$values
Eigen.Decomposition.vectors <- eigen(Manual.Covariance.Matrix)$vectors
Means <- apply(AssignmentData, 2, function(x) mean(x))
Principle.Components <- diff.matrix %*% Eigen.Decomposition.vectors
Loadings <- Eigen.Decomposition.vectors[,1:3]
Factors <- Principle.Components[,1:3]
barplot(Eigen.Decomposition.values/sum(Eigen.Decomposition.values),width=2,ylim = c(0,1),col = "black",
names.arg=c("F1","F2","F3","F4","F5","F6","F7"))
Eigen.Decomposition.values/sum(Eigen.Decomposition.values)
## [1] 9.783429e-01 1.976344e-02 1.558895e-03 1.803130e-04 1.059986e-04
## [6] 2.864693e-05 1.981838e-05
Maturities <- c(.25,.5,2,3,5,10,30)
matplot(Maturities, Loadings,type="l",lty=1,col=c("black","red","green"),lwd=3)
Based on the shapes of the bar graph showing the breakdown of the variation among all the seven components, Factor 1 is the component that owns the highest proportion of the variance. In other words, Factor 1 is responsible for 97.8% of the variation in the data. In addition, the next highest component is Factor at 1.98% of the data. These two components combined would be responsible for the variation of the data. The second graph shows a line graph of the top three factor loadings against maturities (input variables). The factor loading values will be in the range between -1.00 to +1.00 because it describes the correlation between the factor against the input variable. For example, the firts three factors has loadings (or a correlation) of -0.38 (black line), -0.50 (red line), and -0.53 (green line) against the USGG3M variable. The closer the loading values are to 1.00 or -1.00, the more correlated they are to that specific input variable.
Factors <- Principle.Components[,1:3]
matplot(Factors,type="l",col=c("black","red","green"),lty=1,lwd=3)
Loadings[,1] <- -Loadings[,1]
Factors[,1] <- -Factors[,1]
matplot(Factors,type="l",col=c("black","red","green"),lty=1,lwd=3)
matplot(Maturities,Loadings,type="l",lty=1,col=c("black","red","green"),lwd=3)
plot(Factors[,1],Factors[,2],type="l",lwd=2)
After analyzing the plot above, I am now able to make the following conclusions regarding the first two factors and the interactions between them over time. It is important to know that Factor 2 was generally stable throughout time, while Factor 1 increased steadily throughout time. First, there is a dense, positive slope on the left side of the plot suggesting a positive relationship between Factor 1 and Factor 2. This positive correlation would have occurred some time in the 1980s when Factor 1 increased at a furious rate. Second, the middle part of the plot that is condensed and clumped together suggests a time period where Factor 1 and Factor 2 were relatively stable with little volatility. This could be deduced as the time period late 80s/early 90s. Finally, the right side of the graph is a period of high volatility, where a rise in Factor 1 would come with a fall in Factor 2 (and vice versa).
OldCurve <- AssignmentData[135,]
NewCurve <- AssignmentData[136,]
CurveChange <- NewCurve-OldCurve
FactorsChange <- Factors[136,]-Factors[135,]
ModelCurveAdjustment.1Factor <- OldCurve+t(Loadings[,1])*FactorsChange[1]
ModelCurveAdjustment.2Factors <- OldCurve+t(Loadings[,1])*FactorsChange[1]+t(Loadings[,2])*FactorsChange[2]
ModelCurveAdjustment.3Factors <- OldCurve+t(Loadings[,1])*FactorsChange[1]+t(Loadings[,2])*FactorsChange[2]+
t(Loadings[,3])*FactorsChange[3]
matplot(Maturities,
t(rbind(OldCurve,NewCurve,ModelCurveAdjustment.1Factor,ModelCurveAdjustment.2Factors,
ModelCurveAdjustment.3Factors)),
type="l",lty=c(1,1,2,2,2),col=c("black","red","green","blue","magenta"),lwd=3,ylab="Curve Adjustment")
legend(x="topright",c("Old Curve","New Curve","1-Factor Adj.","2-Factor Adj.",
"3-Factor Adj."),lty=c(1,1,2,2,2),lwd=3,col=c("black","red","green","blue","magenta"))
rbind(CurveChange,ModelCurveAdjustment.3Factors-OldCurve)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR
## CurveChange 1.070000 1.070000 0.8900000 0.8300000 0.7200000 0.5000000
## 1.090063 1.041267 0.9046108 0.8248257 0.6979317 0.5531734
## USGG30YR
## CurveChange 0.4700000
## 0.4357793
After analyzing the line graph above describing how loadings affect the adjustments, the patter seems to be that the more factors you add, the closer the adjustment is to the actual line. More specifically, it took both Factor 1 and Factor 2 for the curve to adjust to a line that fits. Adding in the Factor 3 did not add much significant change to the adjustment.
# How close is the approximation for each maturity?
# 5Y
cbind(Maturities,Loadings)
## Maturities
## [1,] 0.25 0.3839609 -0.50744508 0.5298222
## [2,] 0.50 0.3901870 -0.43946144 0.1114737
## [3,] 2.00 0.4151851 -0.11112721 -0.4187873
## [4,] 3.00 0.4063541 0.01696988 -0.4476561
## [5,] 5.00 0.3860610 0.23140317 -0.2462364
## [6,] 10.00 0.3477544 0.43245979 0.1500903
## [7,] 30.00 0.3047124 0.54421228 0.4979195
Model.10Y<-Loadings[6,1]*Factors[,1]+Loadings[6,2]*Factors[,2]+Loadings[6,3]*Factors[,3]
matplot(cbind(AssignmentData[,6],Model.10Y),type="l",lty=1,lwd=c(3,1),col=c("black","red"),ylab="5Y Yield")
# Do PCA analysis using princomp()
PCA.Yields<-princomp(AssignmentData)
names(PCA.Yields)
## [1] "sdev" "loadings" "center" "scale" "n.obs" "scores"
## [7] "call"
# Check that the loadings are the same
cbind(PCA.Yields$loadings[,1:3],Maturities,Eigen.Decomposition.vectors[,1:3])
## Comp.1 Comp.2 Comp.3 Maturities
## USGG3M -0.3839609 0.50744508 0.5298222 0.25 -0.3839609
## USGG6M -0.3901870 0.43946144 0.1114737 0.50 -0.3901870
## USGG2YR -0.4151851 0.11112721 -0.4187873 2.00 -0.4151851
## USGG3YR -0.4063541 -0.01696988 -0.4476561 3.00 -0.4063541
## USGG5YR -0.3860610 -0.23140317 -0.2462364 5.00 -0.3860610
## USGG10YR -0.3477544 -0.43245979 0.1500903 10.00 -0.3477544
## USGG30YR -0.3047124 -0.54421228 0.4979195 30.00 -0.3047124
##
## USGG3M -0.50744508 0.5298222
## USGG6M -0.43946144 0.1114737
## USGG2YR -0.11112721 -0.4187873
## USGG3YR 0.01696988 -0.4476561
## USGG5YR 0.23140317 -0.2462364
## USGG10YR 0.43245979 0.1500903
## USGG30YR 0.54421228 0.4979195
matplot(Maturities,PCA.Yields$loadings[,1:3],type="l",col=c("black","red","green"),lty=1,lwd=3)
matplot(PCA.Yields$scores[,1:3],type="l",col=c("black","red","green"),lwd=3,lty=1)
PCA.Yields$loadings[,1]<--PCA.Yields$loadings[,1]
PCA.Yields$scores[,1]<--PCA.Yields$scores[,1]
matplot(Maturities,PCA.Yields$loadings[,1:3],type="l",col=c("black","red","green"),lty=1,lwd=3)
matplot(PCA.Yields$scores[,1:3],type="l",col=c("black","red","green"),lwd=3,lty=1)
# What variable we had as Output?
matplot(cbind(PCA.Yields$scores[,1],AssignmentData.Output,Factors[,1]),type="l",col=c("black","red","green"),lwd=c(3,2,1),lty=c(1,2,3),ylab="Factor 1")
Compare the regression coefficients from Step 2 and Step 3 with factor loadings. First, look at the slopes for AssignmentData.Input~AssignmentData.Output
t(apply(AssignmentData, 2, function(AssignmentData.col) lm(AssignmentData.col~AssignmentData.Output)$coef))
## (Intercept) AssignmentData.Output
## USGG3M 4.675134 0.3839609
## USGG6M 4.844370 0.3901870
## USGG2YR 5.438888 0.4151851
## USGG3YR 5.644458 0.4063541
## USGG5YR 6.009421 0.3860610
## USGG10YR 6.481316 0.3477544
## USGG30YR 6.869355 0.3047124
cbind(PCA.Yields$center,PCA.Yields$loadings[,1])
## [,1] [,2]
## USGG3M 4.675134 0.3839609
## USGG6M 4.844370 0.3901870
## USGG2YR 5.438888 0.4151851
## USGG3YR 5.644458 0.4063541
## USGG5YR 6.009421 0.3860610
## USGG10YR 6.481316 0.3477544
## USGG30YR 6.869355 0.3047124
This shows that the zero loading equals the vector of intercepts of models Y~Output1, where Y is one of the columns of yields in the data. Also, the slopes of the same models are equal to the first loading.
Check if the same is true in the opposite direction: is there a correspondence between the coefficients of models Output1~Yield and the first loading.
AssignmentData.Centered<-t(apply(AssignmentData,1,function(AssignmentData.row) AssignmentData.row-PCA.Yields$center))
dim(AssignmentData.Centered)
## [1] 8300 7
t(apply(AssignmentData.Centered, 2, function(AssignmentData.col) lm(AssignmentData.Output~AssignmentData.col)$coef))
## (Intercept) AssignmentData.col
## USGG3M 1.420077e-11 2.507561
## USGG6M 1.421187e-11 2.497235
## USGG2YR 1.419747e-11 2.400449
## USGG3YR 1.419989e-11 2.455793
## USGG5YR 1.419549e-11 2.568742
## USGG10YR 1.420297e-11 2.786991
## USGG30YR 1.420965e-11 3.069561
To recover the loading of the first factor by doing regression, use all inputs together.
t(lm(AssignmentData.Output~AssignmentData.Centered)$coef)[-1]
## [1] 0.3839609 0.3901870 0.4151851 0.4063541 0.3860610 0.3477544 0.3047124
PCA.Yields$loadings[,1]
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## 0.3839609 0.3901870 0.4151851 0.4063541 0.3860610 0.3477544 0.3047124
This means that the factor is a portfolio of all input variables with weights.
PCA.Yields$loadings[,1]
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## 0.3839609 0.3901870 0.4151851 0.4063541 0.3860610 0.3477544 0.3047124